Skip to content

26.1 Antalya port - Alternative syntax for cluster functions#1390

Open
ianton-ru wants to merge 27 commits intoantalya-26.1from
frontport/antalya-26.1/alternative_syntax
Open

26.1 Antalya port - Alternative syntax for cluster functions#1390
ianton-ru wants to merge 27 commits intoantalya-26.1from
frontport/antalya-26.1/alternative_syntax

Conversation

@ianton-ru
Copy link

@ianton-ru ianton-ru commented Feb 11, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Frontports for Antalya 26.1

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Workflow [PR], commit [dea01b5]

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5b1756a56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

auto arguments = make_intrusive<ASTExpressionList>();

arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.storage_account_url));
arguments->children.push_back(make_intrusive<ASTIdentifier>(connection_params.endpoint.container_name));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit Azure container argument as a string literal

When StorageObjectStorageCluster rewrites Azure table-engine reads into table-function calls (for object_storage_cluster/cluster-mode execution), this argument builder emits the container as ASTIdentifier instead of ASTLiteral. Azure table-function parsing expects a literal container name, so distributed reads can fail with argument/identifier errors on remote nodes. This should serialize container_name as a literal string to keep rewritten queries valid.

Useful? React with 👍 / 👎.


updateBeforeRead(context);
auto cluster = getCluster(context);
auto cluster = getClusterImpl(context, cluster_name_from_settings, context->getSettingsRef()[Setting::object_storage_max_nodes]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit host capping to object-storage cluster engines only

This unconditionally applies object_storage_max_nodes in base IStorageCluster::read, so non-object-storage engines that also inherit IStorageCluster (notably StorageURLCluster and StorageFileCluster) will also have shards silently dropped when the setting is non-zero. That can return partial query results for urlCluster/fileCluster even though the setting is documented for object-storage cluster functions.

Useful? React with 👍 / 👎.

Comment on lines +445 to +449
auto p = function_to_cluster_function.find(table_function->name);
if (p == function_to_cluster_function.end())
{
throw Exception(
ErrorCodes::LOGICAL_ERROR,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle deltaLakeLocal in cluster-function rewrite

In cluster mode for table engines, DeltaLakeLocal is rewritten to table function deltaLakeLocal, but this conversion path only maps functions that have *Cluster equivalents and then throws LOGICAL_ERROR when none is found. As a result, reads of ENGINE=DeltaLakeLocal with non-empty object_storage_cluster fail with an internal error instead of a supported path or a proper user-facing validation error.

Useful? React with 👍 / 👎.

@ianton-ru ianton-ru added antalya antalya-26.1 port-antalya PRs to be ported to all new Antalya releases labels Feb 13, 2026
@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6048e5d947

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


#if USE_AVRO
using TableFunctionIcebergCluster = TableFunctionObjectStorageCluster<IcebergClusterDefinition, StorageIcebergConfiguration, true>;
using TableFunctionIcebergLocalCluster = TableFunctionObjectStorageCluster<IcebergLocalClusterDefinition, StorageLocalIcebergConfiguration, true>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove duplicate IcebergLocal cluster type alias

This header now declares TableFunctionIcebergLocalCluster twice under the same #if USE_AVRO guard, which creates a redefinition in AVRO-enabled builds and breaks compilation of the table-function targets. Keeping a single alias declaration avoids a hard build failure for Iceberg-capable configurations.

Useful? React with 👍 / 👎.

Comment on lines +331 to +333
auto p = engine_to_function.find(storage_engine_name);
if (p == engine_to_function.end())
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add S3-alias mappings before throwing on engine rewrite

This lookup throws when storage_engine_name is not in engine_to_function, but the rewrite path is also used for GCS, COSN, and OSS engines (their storage engine names are defined separately in StorageObjectStorageDefinitions.h). With object_storage_cluster enabled, queries against those engines now fail with LOGICAL_ERROR instead of executing, so these aliases need explicit mapping or a fallback in this conversion step.

Useful? React with 👍 / 👎.

{
chassert(result.count == 0); /// We shouldn't use replacement with masking other arguments
result.start = url_arg_idx;
result.start = function->arguments->getRealIndex(url_arg_idx);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align SAS masking index with skipped-argument offsets

After introducing skipped arguments (used for iceberg(..., storage_type=...)), Azure secret masking must always map back to original argument positions; this branch was updated for AccountKey, but the SharedAccessSignature branch below still writes a relative index. In queries like iceberg(storage_type='azure', '...SharedAccessSignature=...'), that mismatch masks the wrong argument and can leave the SAS token unredacted.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3c7d36056

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 420 to 423
auto args_copy = args;
// Remove cluster name from args to avoid confusing cluster name and named collection name
args_copy.erase(args_copy.begin());
configuration->extractDynamicStorageType(args_copy, context, &object_storage_type_arg);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Extract dynamic storage_type from live args

For old *Cluster syntax (for example icebergCluster(..., storage_type='azure')), this code extracts storage_type from args_copy instead of args, so the original argument remains in args and then gets appended again via object_storage_type_arg later in updateQueryToSendIfNeeded. The forwarded query can end up with two storage_type key-value arguments, and worker-side parsing rejects it with the duplicate-parameter error, which breaks distributed reads that use explicit storage_type.

Useful? React with 👍 / 👎.

assert int(hosts_engine_distributed) == 3


def test_distributed_s3_table_engine(started_cluster):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Remove duplicate test_distributed_s3_table_engine

This module defines test_distributed_s3_table_engine twice; in Python the later definition overwrites the earlier one at import time, so one full test body is never executed. That silently drops intended coverage and can hide regressions in this path while CI still reports green.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ffa292f43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

#endif

#if USE_AVRO
using TableFunctionIcebergCluster = TableFunctionObjectStorageCluster<IcebergClusterDefinition, StorageIcebergConfiguration, true>;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Route icebergCluster masking through storage_type-aware logic

TableFunctionIcebergCluster now uses StorageIcebergConfiguration, so icebergCluster(..., storage_type='azure', ...) can carry Azure secrets, but FunctionSecretArgumentsFinder::findOrdinaryFunctionSecretArguments still classifies icebergCluster under the S3 branch (findS3FunctionSecretArguments(true)) rather than the new Iceberg storage-type-aware masking path. In cluster queries this can leave account_key/connection-string secrets visible in query text and system.query_log (while masking the wrong positional argument), so the cluster Iceberg variant needs the same storage-type dispatch that was added for iceberg.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df401f6c5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

UInt32 shard_num = 0;
for (auto & shard_info : shards_info)
{
addresses_with_failover_.push_back(addresses_with_failover[shard_info.shard_num - 1]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve address mapping when capping cluster host count

constrainShardInfoAndAddressesToMaxHosts indexes addresses_with_failover with shard_info.shard_num - 1, but in replicas-as-shards mode shard_num comes from the original shard id (not the selected replica position) and can repeat across replicas. When object_storage_max_nodes is used on a cluster that has replicas, this remaps selected shards to wrong/duplicate addresses and can desynchronize shards_info from addresses_with_failover, so distributed object-storage reads are sent to unintended hosts and lose expected parallelism.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constrainShardInfoAndAddressesToMaxHosts called from Cluster constructor with ReplicasAsShardsTag, each node is in separate shard, so no shards with more than one replica.

auto storage_type = findIcebergStorageType(named_collection_pos);

if (storage_type == "s3")
findS3FunctionSecretArguments(false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use cluster argument offset for icebergCluster secret masking

For icebergCluster(...), this path always calls the non-cluster masking helpers, so argument indexing starts at position 0 instead of skipping the cluster name. With positional credentials (for example S3-style access key + secret), the code masks the wrong argument and can leave the actual secret key visible in logs/query text, which is a sensitive-data regression specific to icebergCluster queries.

Useful? React with 👍 / 👎.

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0da7fa3b9e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1025 to +1028
arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::access_key_id].value));
arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::secret_access_key].value));
if (!s3_settings->auth_settings[S3AuthSetting::session_token].value.empty())
arguments->children.push_back(make_intrusive<ASTLiteral>(s3_settings->auth_settings[S3AuthSetting::session_token].value));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward full S3 auth options in distributed rewrites

StorageObjectStorageCluster::updateQueryForDistributedEngineIfNeeded builds remote table-function calls from createArgsWithAccessData(), but this serializer only emits URL plus key/secret/session/NOSIGN fields. Table-level S3 auth passed via extra_credentials (for example role_arn/role_session_name) and related auth options are not forwarded, so queries against S3 engine tables that rely on those settings can fail on worker nodes after rewrite because they execute with different credentials.

Useful? React with 👍 / 👎.

Comment on lines +827 to +831
arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.storage_account_url));
arguments->children.push_back(make_intrusive<ASTLiteral>(connection_params.endpoint.container_name));
arguments->children.push_back(make_intrusive<ASTLiteral>(blob_path.path));
if (account_name && account_key)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve Azure auth context when serializing engine args

The rewritten Azure arguments only include storage_account_url, container, blob path, and optional account key pair, but omit other parsed auth state (notably SAS/workload-identity details). Since distributed engine reads are rewritten through this serializer, Azure tables configured with SAS URL auth or explicit workload identity credentials lose required auth data on remote nodes and can fail with authorization errors.

Useful? React with 👍 / 👎.

ianton-ru and others added 6 commits February 20, 2026 09:14
…_rest_warehouses

25.8 Antalya ports: Improvement for Iceberg REST catalog
…for_iceberg_timestamptz

Timezone for iceberg timestamptz
…etrics

More profile metrics for Iceberg, S3 and Azure
Documentation for Swarm features in Antalya branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments